Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Biom J ; 66(1): e2300077, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37857533

RESUMO

P-values that are derived from continuously distributed test statistics are typically uniformly distributed on (0,1) under least favorable parameter configurations (LFCs) in the null hypothesis. Conservativeness of a p-value P (meaning that P is under the null hypothesis stochastically larger than uniform on (0,1)) can occur if the test statistic from which P is derived is discrete, or if the true parameter value under the null is not an LFC. To deal with both of these sources of conservativeness, we present two approaches utilizing randomized p-values. We illustrate their effectiveness for testing a composite null hypothesis under a binomial model. We also give an example of how the proposed p-values can be used to test a composite null in group testing designs. We find that the proposed randomized p-values are less conservative compared to nonrandomized p-values under the null hypothesis, but that they are stochastically not smaller under the alternative. The problem of establishing the validity of randomized p-values has received attention in previous literature. We show that our proposed randomized p-values are valid under various discrete statistical models, which are such that the distribution of the corresponding test statistic belongs to an exponential family. The behavior of the power function for the tests based on the proposed randomized p-values as a function of the sample size is also investigated. Simulations and a real data example are used to compare the different considered p-values.


Assuntos
Modelos Estatísticos , Tamanho da Amostra
2.
Stat Med ; 42(17): 2944-2961, 2023 07 30.
Artigo em Inglês | MEDLINE | ID: mdl-37173292

RESUMO

Modern high-throughput biomedical devices routinely produce data on a large scale, and the analysis of high-dimensional datasets has become commonplace in biomedical studies. However, given thousands or tens of thousands of measured variables in these datasets, extracting meaningful features poses a challenge. In this article, we propose a procedure to evaluate the strength of the associations between a nominal (categorical) response variable and multiple features simultaneously. Specifically, we propose a framework of large-scale multiple testing under arbitrary correlation dependency among test statistics. First, marginal multinomial regressions are performed for each feature individually. Second, we use an approach of multiple marginal models for each baseline-category pair to establish asymptotic joint normality of the stacked vector of the marginal multinomial regression coefficients. Third, we estimate the (limiting) covariance matrix between the estimated coefficients from all marginal models. Finally, our approach approximates the realized false discovery proportion of a thresholding procedure for the marginal p-values for each baseline-category logit pair. The proposed approach offers a sensible trade-off between the expected numbers of true and false findings. Furthermore, we demonstrate a practical application of the method on hyperspectral imaging data. This dataset is obtained by a matrix-assisted laser desorption/ionization (MALDI) instrument. MALDI demonstrates tremendous potential for clinical diagnosis, particularly for cancer research. In our application, the nominal response categories represent cancer (sub-)types.


Assuntos
Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz , Humanos , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos , Estatística como Assunto
3.
PLoS One ; 18(2): e0280503, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36724145

RESUMO

We present a new approach to modeling the future development of extreme temperatures globally and on the time-scale of several centuries by using non-stationary generalized extreme value distributions in combination with logistic functions. The statistical models we propose are applied to annual maxima of daily temperature data from fully coupled climate models spanning the years 1850 through 2300. They enable us to investigate how extremes will change depending on the geographic location not only in terms of the magnitude, but also in terms of the timing of the changes. We find that in general, changes in extremes are stronger and more rapid over land masses than over oceans. In addition, our statistical models allow for changes in the different parameters of the fitted generalized extreme value distributions (a location, a scale and a shape parameter) to take place independently and at varying time periods. Different statistical models are presented and the Bayesian Information Criterion is used for model selection. It turns out that in most regions, changes in mean and variance take place simultaneously while the shape parameter of the distribution is predicted to stay constant. In the Arctic region, however, a different picture emerges: There, climate variability is predicted to increase rather quickly in the second half of the twenty-first century, probably due to the melting of ice, whereas changes in the mean values take longer and come into effect later.


Assuntos
Clima , Modelos Estatísticos , Temperatura , Teorema de Bayes , Mudança Climática
4.
Biom J ; 65(2): e2100328, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36029271

RESUMO

Large-scale hypothesis testing has become a ubiquitous problem in high-dimensional statistical inference, with broad applications in various scientific disciplines. One relevant application is constituted by imaging mass spectrometry (IMS) association studies, where a large number of tests are performed simultaneously in order to identify molecular masses that are associated with a particular phenotype, for example, a cancer subtype. Mass spectra obtained from matrix-assisted laser desorption/ionization (MALDI) experiments are dependent, when considered as statistical quantities. False discovery proportion (FDP) estimation and  control under arbitrary dependency structure among test statistics is an active topic in modern multiple testing research. In this context, we are concerned with the evaluation of associations between the binary outcome variable (describing the phenotype) and multiple predictors derived from MALDI measurements. We propose an inference procedure in which the correlation matrix of the test statistics is utilized. The approach is based on multiple marginal models. Specifically, we fit a marginal logistic regression model for each predictor individually. Asymptotic joint normality of the stacked vector of the marginal regression coefficients is established under standard regularity assumptions, and their (limiting) correlation matrix is estimated. The proposed method extracts common factors from the resulting empirical correlation matrix. Finally, we estimate the realized FDP of a thresholding procedure for the marginal p-values. We demonstrate a practical application of the proposed workflow to MALDI IMS data in an oncological context.


Assuntos
Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos
5.
Biom J ; 64(2): 197, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35152458
6.
Biom J ; 64(2): 384-409, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-33464615

RESUMO

We are concerned with testing replicability hypotheses for many endpoints simultaneously. This constitutes a multiple test problem with composite null hypotheses. Traditional p$p$ -values, which are computed under least favorable parameter configurations (LFCs), are over-conservative in the case of composite null hypotheses. As demonstrated in prior work, this poses severe challenges in the multiple testing context, especially when one goal of the statistical analysis is to estimate the proportion π0$\pi _0$ of true null hypotheses. Randomized p$p$ -values have been proposed to remedy this issue. In the present work, we discuss the application of randomized p$p$ -values in replicability analysis. In particular, we introduce a general class of statistical models for which valid, randomized p$p$ -values can be calculated easily. By means of computer simulations, we demonstrate that their usage typically leads to a much more accurate estimation of π0$\pi _0$ than the LFC-based approach. Finally, we apply our proposed methodology to a real data example from genomics.


Assuntos
Genômica , Modelos Estatísticos , Simulação por Computador
7.
Biom J ; 61(1): 40-61, 2019 01.
Artigo em Inglês | MEDLINE | ID: mdl-30003587

RESUMO

Multivariate multiple test procedures have received growing attention recently. This is due to the fact that data generated by modern applications typically are high-dimensional, but possess pronounced dependencies due to the technical mechanisms involved in the experiments. Hence, it is possible and often necessary to exploit these dependencies in order to achieve reasonable power. In the present paper, we express dependency structures in the most general manner, namely, by means of copula functions. One class of nonparametric copula estimators is constituted by Bernstein copulae. We extend previous statistical results regarding bivariate Bernstein copulae to the multivariate case and study their impact on multiple tests. In particular, we utilize them to derive asymptotic confidence regions for the family-wise error rate (FWER) of multiple test procedures that are empirically calibrated by making use of Bernstein copulae approximations of the dependency structure among the test statistics. This extends a similar approach by Stange et al. (2015) in the parametric case. A simulation study quantifies the gain in FWER level exhaustion and, consequently, power that can be achieved by exploiting the dependencies, in comparison with common threshold calibrations like the Bonferroni or Sidák corrections. Finally, we demonstrate an application of the proposed methodology to real-life data from insurance.


Assuntos
Biometria/métodos , Calibragem , Análise Multivariada , Estatísticas não Paramétricas
8.
Sci Rep ; 6: 36671, 2016 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-27892471

RESUMO

The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.

9.
PLoS One ; 11(2): e0149016, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26914144

RESUMO

Signal detection in functional magnetic resonance imaging (fMRI) inherently involves the problem of testing a large number of hypotheses. A popular strategy to address this multiplicity is the control of the false discovery rate (FDR). In this work we consider the case where prior knowledge is available to partition the set of all hypotheses into disjoint subsets or families, e. g., by a-priori knowledge on the functionality of certain regions of interest. If the proportion of true null hypotheses differs between families, this structural information can be used to increase statistical power. We propose a two-stage multiple test procedure which first excludes those families from the analysis for which there is no strong evidence for containing true alternatives. We show control of the family-wise error rate at this first stage of testing. Then, at the second stage, we proceed to test the hypotheses within each non-excluded family and obtain asymptotic control of the FDR within each family at this second stage. Our main mathematical result is that this two-stage strategy implies asymptotic control of the FDR with respect to all hypotheses. In simulations we demonstrate the increased power of this new procedure in comparison with established procedures in situations with highly unbalanced families. Finally, we apply the proposed method to simulated and to real fMRI data.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética , Algoritmos , Encéfalo/fisiologia , Reações Falso-Positivas , Dinâmica não Linear
10.
Stat Appl Genet Mol Biol ; 14(5): 497-505, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26506100

RESUMO

We are concerned with statistical inference for 2 × C × K contingency tables in the context of genetic case-control association studies. Multivariate methods based on asymptotic Gaussianity of vectors of test statistics require information about the asymptotic correlation structure among these test statistics under the global null hypothesis. In the case of C=2, we show that for a wide variety of test statistics this asymptotic correlation structure is given by the standardized linkage disequilibrium matrix of the K loci under investigation. Three popular choices of test statistics are discussed for illustration. In the case of C=3, the standardized composite linkage disequilibrium matrix is the limiting correlation matrix of the K locus-specific Cochran-Armitage trend test statistics.


Assuntos
Estudos de Associação Genética , Algoritmos , Estudos de Casos e Controles , Interpretação Estatística de Dados , Loci Gênicos , Predisposição Genética para Doença , Humanos , Desequilíbrio de Ligação , Modelos Genéticos , Modelos Estatísticos , Distribuição Normal , Razão de Chances
11.
Bioinformatics ; 31(22): 3577-83, 2015 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-26249812

RESUMO

MOTIVATION: When analyzing a case group of patients with ultra-rare disorders the ethnicities are often diverse and the data quality might vary. The population substructure in the case group as well as the heterogeneous data quality can cause substantial inflation of test statistics and result in spurious associations in case-control studies if not properly adjusted for. Existing techniques to correct for confounding effects were especially developed for common variants and are not applicable to rare variants. RESULTS: We analyzed strategies to select suitable controls for cases that are based on similarity metrics that vary in their weighting schemes. We simulated different disease entities on real exome data and show that a similarity-based selection scheme can help to reduce false positive associations and to optimize the performance of the statistical tests. Especially when data quality as well as ethnicities vary a lot in the case group, a matching approach that puts more weight on rare variants shows the best performance. We reanalyzed collections of unrelated patients with Kabuki make-up syndrome, Hyperphosphatasia with Mental Retardation syndrome and Catel-Manzke syndrome for which the disease genes were recently described. We show that rare variant association tests are more sensitive and specific in identifying the disease gene than intersection filters and should thus be considered as a favorable approach in analyzing even small patient cohorts. AVAILABILITY AND IMPLEMENTATION: Datasets used in our analysis are available at ftp://ftp.1000genomes.ebi.ac.uk./vol1/ftp/ CONTACT: : peter.krawitz@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estudos de Associação Genética , Variação Genética , Estudos de Casos e Controles , Confiabilidade dos Dados , Doença/genética , Etnicidade/genética , Humanos , Curva ROC , Análise de Sequência de DNA
12.
Stat Appl Genet Mol Biol ; 14(4): 347-60, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26215535

RESUMO

Genetic association studies lead to simultaneous categorical data analysis. The sample for every genetic locus consists of a contingency table containing the numbers of observed genotype-phenotype combinations. Under case-control design, the row counts of every table are identical and fixed, while column counts are random. The aim of the statistical analysis is to test independence of the phenotype and the genotype at every locus. We present an objective Bayesian methodology for these association tests, which relies on the conjugacy of Dirichlet and multinomial distributions. Being based on the likelihood principle, the Bayesian tests avoid looping over all tables with given marginals. Making use of data generated by The Wellcome Trust Case Control Consortium (WTCCC), we illustrate that the ordering of the Bayes factors shows a good agreement with that of frequentist p-values. Furthermore, we deal with specifying prior probabilities for the validity of the null hypotheses, by taking linkage disequilibrium structure into account and exploiting the concept of effective numbers of tests. Application of a Bayesian decision theoretic multiple test procedure to the WTCCC data illustrates the proposed methodology. Finally, we discuss two methods for reconciling frequentist and Bayesian approaches to the multiple association test problem.


Assuntos
Teorema de Bayes , Estudos de Associação Genética , Algoritmos , Alelos , Simulação por Computador , Loci Gênicos , Genótipo , Humanos , Modelos Genéticos , Modelos Estatísticos , Fenótipo
13.
PLoS One ; 10(5): e0125587, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25965389

RESUMO

Epigenetic research leads to complex data structures. Since parametric model assumptions for the distribution of epigenetic data are hard to verify we introduce in the present work a nonparametric statistical framework for two-group comparisons. Furthermore, epigenetic analyses are often performed at various genetic loci simultaneously. Hence, in order to be able to draw valid conclusions for specific loci, an appropriate multiple testing correction is necessary. Finally, with technologies available for the simultaneous assessment of many interrelated biological parameters (such as gene arrays), statistical approaches also need to deal with a possibly unknown dependency structure in the data. Our statistical approach to the nonparametric comparison of two samples with independent multivariate observables is based on recently developed multivariate multiple permutation tests. We adapt their theory in order to cope with families of hypotheses regarding relative effects. Our results indicate that the multivariate multiple permutation test keeps the pre-assigned type I error level for the global null hypothesis. In combination with the closure principle, the family-wise error rate for the simultaneous test of the corresponding locus/parameter-specific null hypotheses can be controlled. In applications we demonstrate that group differences in epigenetic data can be detected reliably with our methodology.


Assuntos
Biologia Computacional/métodos , Epigênese Genética , Interpretação Estatística de Dados , Bases de Dados Genéticas , Humanos , Distribuições Estatísticas
14.
Front Psychol ; 5: 704, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25071670

RESUMO

OBJECTIVE: The aim of this longitudinal study was to identify predictors of instantaneous well-being in patients with amyotrophic lateral sclerosis (ALS). Based on flow theory well-being was expected to be highest when perceived demands and perceived control were in balance, and that thinking about the past would be a risk factor for rumination which would in turn reduce well-being. METHODS: Using the experience sampling method, data on current activities, associated aspects of perceived demands, control, and well-being were collected from 10 patients with ALS three times a day for two weeks. RESULTS: RESULTS show that perceived control was uniformly and positively associated with well-being, but that demands were only positively associated with well-being when they were perceived as controllable. Mediation analysis confirmed thinking about the past, but not thinking about the future, to be a risk factor for rumination and reduced well-being. DISCUSSION: Findings extend our knowledge of factors contributing to well-being in ALS as not only perceived control but also perceived demands can contribute to well-being. They further show that a focus on present experiences might contribute to increased well-being.

15.
Epigenetics ; 8(11): 1226-35, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24071829

RESUMO

The adaptive immune system is involved in tumor establishment and aggressiveness. Tumors of the ovaries, an immune-privileged organ, spread via transceolomic routes and rarely to distant organs. This is contrary to tumors of non-immune privileged organs, which often disseminate hematogenously to distant organs. Epigenetics-based immune cell quantification allows direct comparison of the immune status in benign and malignant tissues and in blood. Here, we introduce the "cellular ratio of immune tolerance" (immunoCRIT) as defined by the ratio of regulatory T cells to total T lymphocytes. The immunoCRIT was analyzed on 273 benign tissue samples of colorectal, bronchial, renal and ovarian origin as well as in 808 samples from primary colorectal, bronchial, mammary and ovarian cancers. ImmunoCRIT is strongly increased in all cancerous tissues and gradually augmented strictly dependent on tumor aggressiveness. In peripheral blood of ovarian cancer patients, immunoCRIT incrementally increases from primary diagnosis to disease recurrence, at which distant metastases frequently occur. We postulate that non-pathological immunoCRIT values observed in peripheral blood of immune privileged ovarian tumor patients are sufficient to prevent hematogenous spread at primary diagnosis. Contrarily, non-immune privileged tumors establish high immunoCRIT in an immunological environment equivalent to the bloodstream and thus spread hematogenously to distant organs. In summary, our data suggest that the immunoCRIT is a powerful marker for tumor aggressiveness and disease dissemination.


Assuntos
Biomarcadores Tumorais/imunologia , Tolerância Imunológica , Neoplasias/imunologia , Neoplasias/patologia , Adulto , Idoso , Neoplasias da Mama/imunologia , Neoplasias da Mama/patologia , Estudos de Casos e Controles , Neoplasias Colorretais/imunologia , Neoplasias Colorretais/patologia , Epigênese Genética , Feminino , Humanos , Neoplasias Renais/imunologia , Neoplasias Renais/patologia , Neoplasias Pulmonares/imunologia , Neoplasias Pulmonares/patologia , Pessoa de Meia-Idade , Metástase Neoplásica , Neoplasias Ovarianas/imunologia , Neoplasias Ovarianas/patologia , Linfócitos T/imunologia , Linfócitos T/patologia , Adulto Jovem
16.
Genome Med ; 5(7): 69, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23902830

RESUMO

With exome sequencing becoming a tool for mutation detection in routine diagnostics there is an increasing need for platform-independent methods of quality control. We present a genotype-weighted metric that allows comparison of all the variant calls of an exome to a high-quality reference dataset of an ethnically matched population. The exome-wide genotyping accuracy is estimated from the distance to this reference set, and does not require any further knowledge about data generation or the bioinformatics involved. The distances of our metric are visualized by non-metric multidimensional scaling and serve as an intuitive, standardizable score for the quality assessment of exome data.

17.
J Neural Eng ; 10(3): 036025, 2013 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-23685458

RESUMO

OBJECTIVE: In brain-computer interface (BCI) research, systems based on event-related potentials (ERP) are considered particularly successful and robust. This stems in part from the repeated stimulation which counteracts the low signal-to-noise ratio in electroencephalograms. Repeated stimulation leads to an optimization problem, as more repetitions also cost more time. The optimal number of repetitions thus represents a data-dependent trade-off between the stimulation time and the obtained accuracy. Several methods for dealing with this have been proposed as 'early stopping', 'dynamic stopping' or 'adaptive stimulation'. Despite their high potential for BCI systems at the patient's bedside, those methods are typically ignored in current BCI literature. The goal of the current study is to assess the benefit of these methods. APPROACH: This study assesses for the first time the existing methods on a common benchmark of both artificially generated data and real BCI data of 83 BCI sessions, allowing for a direct comparison between these methods in the context of text entry. MAIN RESULTS: The results clearly show the beneficial effect on the online performance of a BCI system, if the trade-off between the number of stimulus repetitions and accuracy is optimized. All assessed methods work very well for data of good subjects, and worse for data of low-performing subjects. Most methods, however, are robust in the sense that they do not reduce the performance below the baseline of a simple no stopping strategy. SIGNIFICANCE: Since all methods can be realized as a module between the BCI and an application, minimal changes are needed to include these methods into existing BCI software architectures. Furthermore, the hyperparameters of most methods depend to a large extend on only a single variable-the discriminability of the training data. For the convenience of BCI practitioners, the present study proposes linear regression coefficients for directly estimating the hyperparameters from the data based on this discriminability. The data that were used in this publication are made publicly available to benchmark future methods.


Assuntos
Algoritmos , Mapeamento Encefálico/métodos , Interfaces Cérebro-Computador , Encéfalo/fisiologia , Eletroencefalografia/métodos , Potenciais Evocados/fisiologia , Reconhecimento Automatizado de Padrão/métodos , Inteligência Artificial , Humanos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
18.
Biom J ; 55(3): 463-77, 2013 May.
Artigo em Inglês | MEDLINE | ID: mdl-23378199

RESUMO

Connecting multiple testing with binary classification, we derive a false discovery rate-based classification approach for two-class mixture models, where the available data (represented as feature vectors) for each individual comparison take values in Rd for some d≥1 and may exhibit certain forms of autocorrelation. This generalizes previous findings for the independent case in dimension d=1. Two resulting classification procedures are described which allow for incorporating prior knowledge about class probabilities and for user-supplied weighting of the severity of misclassifying a member of the "0"-class as "1" and vice versa. The key mathematical tools to be employed are multivariate estimation methods for probability density functions or density ratios. We compare the two algorithms with respect to their theoretical properties and with respect to their performance in practice. Computer simulations indicate that they can both successfully be applied to autocorrelated time series data with moving average structure. Our approach was inspired and its practicability will be demonstrated by applications from the field of brain-computer interfacing and the processing of electroencephalography data.


Assuntos
Algoritmos , Interpretação Estatística de Dados , Modelos Estatísticos , Encéfalo/fisiologia , Simulação por Computador , Eletroencefalografia/métodos , Reações Falso-Positivas , Humanos , Processamento de Sinais Assistido por Computador
19.
Stat Appl Genet Mol Biol ; 11(4)2012 Jul 27.
Artigo em Inglês | MEDLINE | ID: mdl-22850061

RESUMO

We study exact tests for (2 x 2) and (2 x 3) contingency tables, in particular exact chi-squared tests and exact tests of Fisher type. In practice, these tests are typically carried out without randomization, leading to reproducible results but not exhausting the significance level. We discuss that this can lead to methodological and practical issues in a multiple testing framework when many tables are simultaneously under consideration as in genetic association studies.Realized randomized p-values are proposed as a solution which is especially useful for data-adaptive (plug-in) procedures. These p-values allow to estimate the proportion of true null hypotheses much more accurately than their non-randomized counterparts. Moreover, we address the problem of positively correlated p-values for association by considering techniques to reduce multiplicity by estimating the "effective number of tests" from the correlation structure.An algorithm is provided that bundles all these aspects, efficient computer implementations are made available, a small-scale simulation study is presented and two real data examples are shown.


Assuntos
Algoritmos , Estudos de Associação Genética , Estudos de Casos e Controles , Distribuição de Qui-Quadrado , Biologia Computacional/métodos , Biologia Computacional/normas , Simulação por Computador , Perfilação da Expressão Gênica , Estudos de Associação Genética/estatística & dados numéricos , Marcadores Genéticos/fisiologia , Ensaios de Triagem em Larga Escala/métodos , Ensaios de Triagem em Larga Escala/estatística & dados numéricos , Humanos , Distribuição Aleatória , Projetos de Pesquisa
20.
Nucleic Acids Res ; 40(6): 2426-31, 2012 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-22127862

RESUMO

With the availability of next-generation sequencing (NGS) technology, it is expected that sequence variants may be called on a genomic scale. Here, we demonstrate that a deeper understanding of the distribution of the variant call frequencies at heterozygous loci in NGS data sets is a prerequisite for sensitive variant detection. We model the crucial steps in an NGS protocol as a stochastic branching process and derive a mathematical framework for the expected distribution of alleles at heterozygous loci before measurement that is sequencing. We confirm our theoretical results by analyzing technical replicates of human exome data and demonstrate that the variance of allele frequencies at heterozygous loci is higher than expected by a simple binomial distribution. Due to this high variance, mutation callers relying on binomial distributed priors are less sensitive for heterozygous variants that deviate strongly from the expected mean frequency. Our results also indicate that error rates can be reduced to a greater degree by technical replicates than by increasing sequencing depth.


Assuntos
Frequência do Gene , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Alelos , Exoma , Heterozigoto , Humanos , Processos Estocásticos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...